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On  the  Design  of  a  Pipeline.!/  Systolic  Finite  Elenent  Systen 


A3STRACT 


A  parallel  finite  elenent  systen  is  suggested  based  on  the  idea  of  pipe¬ 
lining  the  computations  corresponding  to  the  different  finite  elements.  The 
systolic  architecture  is  used  extensively  in  the  design  to  satisfy  a  regular  and 
smooth  flow  of  data  in  the  pipe.  Also  a  node  nunbering  algorithm  is  developed 
in  order  to  allow  for  the  application  of  a  frontal  technique  in  the  solution  of 
the  linear  system  of  equations  resulting  fron  the  analysis. 
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1.  Introduction 


In  the  past  few  years,  many  researchers  have  considered  the  use  of  some 
type  of  parallel  processing  in  finite  element  analysis.  For  instance,  Noor  and 
al  studied  algorithms  for  performing  the  analysis  on  the  CDC  Star-100  (see  e.g. 
[10]  ).  Along  the  same  line,  Kamel  and  al  [7]  studied  the  usefulness  of  array 
processors,  combined  with  mini  computers,  in  finite  element  computations.  Also, 
the  use  of  multiprocessors  in  the  solution  of  partial  differential  equations 
were  studied  (see  e.g.  [4]  ).  However,  different  experiments  showed  that  gen¬ 
eral  multiprocessors  are  not  expected  to  give  satisfactory  gain  in  the  process¬ 
ing  speed  since  the  times  for  communication  and  data  transfer  dominate  the  run¬ 
ning  time  (see  e.g.  [12]  ). 

The  most  significant  attempt  in  this  area  is  the  design  of  the  finite  ele¬ 
ment  machine  at  ICASE  [5].  In  this  project,  a  microprocessor  is  assigned  to 
each  node  in  the  finite  element  grid.  Each  processor  is  connected  to  its  eight 
immediate  neighbors,  and  all  the  processors  in  the  systems  are  connected  through 
a  global  bus.  This  machine,  however,  has  some  limitations  that  result  from  the 
direct  correspondence  between  processors  and  nodes.  In  general,  it  is  most 
suitable  if  the  interconnection!  between  its  processors  follow  the  same  pattern 
as  the  finite  element  grid. 

A  closer  study  of  the  different  steps  in  linear  finite  element  analysis 
shows  that  the  computations  may  be  divided  into  separate  phases,  where  each 
phase  depends  only  on  the  preceding  phase.  Hence  the  data  C3n  be  transferred 
from  phase  to  phase  in  a  pipelined  fashion.  The  computation  within  each  phase 
is  also  well  structured  and  mostly  compute  bound,  which  makes  it  a  suitable  can¬ 
didate  for  systolic  architectures. 


In  this  paper,  we  suggest  possible  configur ations  for  a  finite  element 
system  that  are  based  on  the  idea  of  pipelining  the  computations  associated  with 
the  different  elements  in  a  finite  element  grid.  Such  system  may  be  independent 
of  the  domain  of  any  particular  problan  and  of  the  number  of  elements  in  the 
grid  that  covers  this  domain.  Our  principal  aim  i3  to  show  that  the 
pipeline/sy3tolio  idea  may  be  a  valid  candidate  for  parallel  finite  elenent  sys¬ 
tems  rather  than  to  describe  a  'ready  to  implement'  or  an  optinal  design  for 
such  a  system,  whatever  may  be  meant  by  optimal. 

The  concept  of  systolic  operations  has  been  used  extensively  in  the  design 
to  achieve  a  regular  and  smooth  flow  of  data  within  each  functional  unit  in  the 
system.  For  the  precise  specification  of  the  inputs  and  the  outputs  of  the  dif¬ 
ferent  units,  we  use  the  notation  of  the  systolic  model  presented  in  [9],  The 
basic  idea  of  the  model  is  to  associate  with  each  communication  link  in  a  sys¬ 
tolic  network  a  data  sequence  comprising  the  data  items  that  appeared  on  this 
link  at  consecutive  time  units.  Accordingly,  the  computations  performed  by  any 
cell  in  the  network  are  expressed  in  terns  of  operators  on  sequences.  The  model 
is  quite  general  and  may  be  used  for  the  specification  and  formal  verification 
of  systolic  computations,  as  well  as  for  their  simulation.  Here,  we  only  intro¬ 
duce  some  basic  definitions  that  will  be  used  in  the  following  sections. 

Data  sequences:  A  data  sequence  is  a  mapping  from  the  set  of  positive  integers 

to  the  set  R  =R  u  { S }  ,  where  R  is  the  sat  of  real  number  and  5  i3  a  special 
0 

symbol  called  the  "don't  care"  element.  Each  communication  link  in  a  systolic 
network  is  given  a  label  of  the  form  y^  and  a  data  sequence  is  associated 

with  that  link,  that  is  the  greek  letter  n  corresponding  to  y  is  used.  The 

th 

interpretation  of  is  such  that  its  t  ‘  elenent,  r^Ct)  ,  is  the  data  item  that 


appeared  on  the  link  y^  at  tine  t,  with  n^tjsi  indicating  that  wa  do  not  care 
about  the  particular  value  at  that  tine. 

Sequence  operators:  tfe  introduce  infernally  the  following  sequence  operators: 

1)  The  shift  operator  flr  which  inserts  r  6-eienents  at  the  beginning  of  its 

2 

operand  .  For  axanple  if  n  =  a1  #a3  f  *  *  *  •  then  n  n  =  «* 6 f ^  ,a2 ,3^ f . . • . 

(• 

2)  The  spread  operator  9  which  inserts  r  6-elaments  between  every  two  succes¬ 
sive  elements  of  its  operand .  For  ax  an  pie,  9n  =  a1>6,3p,6,a^,5,... 

3)  The  piping  operator  P  has  m  operands  and  concatenates  the  first  k  elements 

m 

of  each  of  its  operands  to  form  one  long  sequence.  For  example,  if  £  = 

3 

b1  ,b2  ,b3  ,b4  ’  *  “  ’  tb3n  P2^,n^  =  bi,&2’  b3  »3i  *a?  *a3  »  6*  •••  •  T!ie  abbrevia¬ 

tion  P^  _(n3)  i3  used  far  Pk(  n1  n1™)  . 
es  i  9 n  ti 

rfe  next  specify  the  clas3  of  problems  that  may  be  solved  using  our 
pipelined/ systolic  system. 

2.  Problem  specification. 

» 

Tne  class  of  boundary  value  problems  considered  here  is  specified  by  a 
variational  formulation  of  the  following  generic  form: 

Given  a  Hilbert  space  w.  a  bilinear  operator  and  a  correspond¬ 
ing  functional  9  on  w  ,  find  the  function  $ t  W  such  that 

=  ,^(v)  for  all  vc#  ( 

tfe  restrict  ourselves  to  problems  on  a  two  dimensional  domain  Q,  and  we 

assume  that  S3  end  9  have  the  general  forms: 


-  4  - 


2  2 


'( v  ,u )  =  L  I  I  a  a  D  v  D«u  dx  dy 

3  p= o  *=o  r,c  r  r 


1 

«v) 

* 

where  ar 

(2.a> 


(2.b) 


where  a  fsa4  ,  r,tf=0,1,2  and  f  are  problen  dependent  functions,  D.  and 

r  ,c  r  ,r  12 

are  the  differential  operators  —  and  — ,  respectively,  is  the  identity 

operator  and  %f  and  are  line  integrals  over  the  boundary  3  3  of  Q.  The  foms 
of  if  and  y  depend  on  the  boundary  conditions  and  are  not  essential  to  the 
purpose  of  our  discussion.  We  also  assune  that  Q  is  covered  by  a  finite  element 
grid  that  contains  m  elenents  of  the  sane  type.  Each  element  a,  Isesn,  is 

^  0  a  q 

characterized  by  its  geometric  support  Q~ ,  by  k  nodes  on  Q  located  at  (  x“  ,  y^ )  , 
isl . Jc,  and  by  some  basis  function  associated  with  each  node. 

The  nodes  in  the  grid  may  be  labeled  by  either  a  local  or  a  global  30 heme. 
In  a  local  scheme,  each  node  in  a  certain  elenent  e  is  identified  by  a  pair 
(e,i)  for  sone  i,  Isiisk.  On  the  other  hand,  3  global  scheme  assigns  a  unique 
integer  j,  Isjsn,  to  each  node,  where  n  is  the  total  nunber  of  nodes  in  the 
grid.  The  relation  between  the  local  label  (e,i)  of  a  node  and  its  global  label 
j  is  defined  by  sone  mapping  glob  :[  1  , n]  x  [  1  ,'<]-►[  1  ,n  ]  ,  wnere  j=glob(e,i). 

Accordingly,  we  may  define  for  each  element  e  the  boolean  matrix  M  of  order 

lc*n  such  that  Me(i,j)s1  if  glob(e,i)  =  j,  and  Ms(i,j)=0  otherwise. 

Given  a  grid  that  covers  Q,  standard  techniques  nay  be  applied  to  compute 
the  finite  element  discretization  of  (1).  Moreover,  an  isoparametric  transfer- 

3  — 

nation  may  be  used  to  map  each  elenent  Q  into  a  fixed  element  Q  on  some  2- 
dimensional  space  (x,y),  and  a  nuneric  quadrature  may  be  applied  to  evaluate 
the  integral  over  Q.  Lat  q  be  the  degree  of  the  quadrature  fornula  and  denote 


f  <*  •*. 


by  (x  ,y  )e3  and  w  the  quadrature  points  and  weights,  respectively. 

S  S  3 

If  the  coefficients  a  r,/=0,1,2  and  the  load  f  are  piece-wise  con¬ 
i'  »» 

stant  functions  equal  to  a8  *  and  fe,  respectively,  on  each  element  Q8,  it  is 

r  ,  r 

well  known  that  an  approximate  solution  of  (1)  may  be  obtained  as  the  solution 
of  the  linear  system  of  equations 

H  u  =  b  (; 

where 

J_)  u  is  the  n-dimen3ional  vector  that  contains  the  values  of  the  approximate 
solution  of  (1)  at  the  n  nodes  of  the  grid. 


2)  H  is  an  nxn  banded,  symmetric,  positive  definite  stiffness  matrix.  In  order 


to  generate  H,  we  first  compute  a  kxk  elemental  matrix  for  each  element  e. 


The  entries  fC  .  •  i  =  1  ,  . . .  ,k  ,  j  s  1  ,  . . .  ,i  of  Hw  are  given  by 
1  f  J 

q 


Vj  =  rj.0  Jl  W  “rWV  B*VVy  l’  ,4> 


where  <|^(x,y)  ,  is1,...,k  are  the  basis  functions  associated  with  the  standard 


element  Q,  and  det8  is  the  determinant  of  the  mapping  3a— *Q.  Each  elemental 


matrix  H  that  corresponds  to  a  boundary  element  is  then  modified  by  the  addi- 

o 


tion  of  a  sparse  matrix  SS  that  is  computed  from  the  term  in  (2. a)  .  The 


resulting  elemental  matrices,  H"s=de+58,  e=1,...,m  are  finally  assembled  into 


the  global  matrix  H  according  to  H  =  \  MeT  H' 

e=1 


b  is  an  n-dlmensional  global  vector  generated  by  first  computing  the  elenen- 


e 


tal  vectors  b  from 


f  )  w  det^C  x  ,y  )  4>.  ( x  ,y  ) 

L.  g  g  g  1  s  g 


g  "g 


1—1  I  •  «  • 


Similar  to  the  elemental  matrices,  each  b~  corresponding  to  a  boundary  elenent 

0 

is  modified  by  the  addition  of  a  vector  s  and  the  resulting  vectors,  b  , 


eT  -e 

e=1,...,m  are  then  assembled  into  b  according  to  b  =  J  M  b 


In  the  previous  discussion,  it  was  assumed  that  $  is  a  real-valued  func¬ 
tion.  It  should  be  noted,  however,  that  the  sane  formulas  are  valid  for  func¬ 
tions  with  d>1  degrees  of  freedom.  In  this  case  a  and  f  are  dxd  matrices 

f  1 

0 

and  d-dimensional  vectors,  respectively,  and  the  entries  of  the  M  natrioes  are 
dxd  unit  matrices  and  zero  matrices  instead  of  ones  and  zeroes,  respectively. 

In  the  remainder  of  this  paper,  we  will  briefly  describe  the  organization 
of  a  complete  pi  pel  ined/ systolic  finite  elenent  system  for  the  above  class  of 
problems.  By  its  very  nature,  any  systolic  or  pipelined  system  needs  to  be  mon¬ 
itored  by  a  host  computer.  In  our  system,  the  host  is  assumed  to  be  a  general 
purpose  computer  that  contains  the  data  base  for  the  problem  and  constitutes  the 
only  means  of  communication  between  the  user  and  the  system.  It  is  responsible 
for  setting,  initiating  and  feeding  the  systolic  pipe  with  the  appropriate  data 
as  well  as  collecting  the  outputs. 

Tne  configuration  of  the  entire  system  depends  on  the  method  used  for  the 
solution  of  the  linear  system  of  equations  (3),  namely,  a  direct  or  an  iterative 
method.  >fe  will  consider  systems  with  different  types  of  solvers  separately. 
However,  we  start  by  discussing  a  functional  unit  that  should  be  included  in  any 


finite  elenent  system,  namely  a  unit  for  the  generation  of  elemental  arrays. 


I.  The  generation  of  eleaental  arrays. 

The  systen  proposed  for  the  generation  of  the  elemental  arrays  is  composed 
of  six  functional  units.  These  units,  denoted  in  Figure  1  by  Ml,...,  N6,  are 
implemented  using  systolic  components  and  are  connected  in  a  cascade  such  that 
the  output  of  one  unit  is  the  input  to  the  next  unit.  In  order  to  compute  the 
elanental  arrays  correspond ing  to  a  certain  elanent  e,  the  system  should  be  sup- 

»  .  g 

plied  by  1)  the  values  of  the  coefficients  a  r,r=0,l,2  and  the  load  f  on 

i"  »* 

3  e 

element  e,  and  2)  the  coordinates  of  the  Sc  nodes  (x^,y^),  is  1  . 


Figure  1  -  Pipelined  generation  of  the  elemental  arrays 

In  addition  to  the  above  data  that  are  dependent  on  the  specific  element 
being  processed,  the  system  should  also  be  supplied  with  the  values  of  the  basis 

functions  $  ,  is1,...,’<,  and  their  derivatives  34^/ 3x  and  34^/ ay  at  the  qua¬ 
drature  points  (x  ,y  )  ,  g=  1  ,  . . .  ,q  .  These  values  are  independent  of  any  par- 
S  S 

ticular  elanent  and  hence  maybe  preloaded  into  the  local  memory  W1  of  the  sys¬ 
tem  and  used  repeatedly  during  the  computation. 


The  precise  specification  of  each  computational  unit  and  a  formal  verifi¬ 
cation  of  its  operation  is  given  in  details  in  [3].  Here,  we  only  present  a 


brief  description  of  the  function  of  each  unit. 


The  unit  N1  is  composed  of  2q  '  mul  tipi y/ add'  systolic  cells.  Its  function 
is  to  compute  the  elements  of  the  Jacobian  of  the  isoparametric  transformation 

Qe+  Q.  The  coordinates  of  the  k  nodes  of  a  certain  element  e  are  supplied  to 

Ml  through  the  input  links  z,  ,  and  z_  ,  and  the  values  of  ?,  (  x  ,  y  )  , 

1,1  2,1  i  i  g 

3  IT  (x  ,  y  )/3x,  and  3?.  (  x  ,  y  )/3y  are  supplied  from  the  local  memory  Li  1. 
i  g  g  i  g  g 

The  exact  specification  of  the  inputs  on  z  and  z  is 

1  f  1  d  9  I 


C1,1  =  9  n' 


C2.  1  *  “  9 


where  for  t£k,  n’(t)  =  y®  and  £S(t)  =  x~,  and  for  t>k,  n“( t)  =  (t)  =  6.  That  is, 

the  x-coord inates  of  the  k  nodes  are  supplied  to  the  system  on  z  ,  starting 

■  f  ■ 

at  time  one  and  separated  from  each  other  by  two  tine  units.  Similarly,  the  y- 
coord inates  are  supplied  on  z2  1  starting  at  time  two  and  separated  from  each 


other  by  two  tine  units. 


The  computed  Jacobian  and  the  basis  functions  are  then  passed  to  N2  which 
is  also  composed  of  2q  'multiply/ add'  cells.  Its  task  is  to  compute  the  values 

of  ?T(g)s  D  H.  (x  ,y  )  ,  rsO,  1,  2,  i  =  1  ,  ...  ,k  and  g=1  ,  ...  ,q  .  These  values  are 
i  r  l  g  g 

0 

then  passed  to  M3  which  computes  the  determinant  det  of  the  transformation  and 
the  valu33  of  vj*(g)  =  w  dete(Xg,yg)  V^(g)  . 

The  unit  M4  is  conposed  of  3kq  'multiply/ add'  cells  connected  as  a  qx3k 

-r  t 

rectangular  array.  It  receives  the  values  of  V^(g)  and  V^Cg)  ^*an  N3«  3n'1 

computes  the  integrals  that  appear  in  equation  (4),  namely 
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£v 
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v  .-. 

kV-  V 
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P  * 
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-  9  - 


.e  ,r  ,1 

i  ,j 


q  _  f 

l  vj* ca)  *:<g>  . 

g=i  J 


i  » j  =  1  ,  •  • »  ,  <  t  r  s  D  ,  1,  2 , 


(7) 


Tne  final  step  in  the  generation  of  H2  is  the  multipl ioation  of  each 
integral  by  the  corresponding  coefficient  a2  g  and  the  conputation  of  the  sum 


h: 


i  , 


V  a®  y  ~  ft 

d  i)  1 1  . 


j  =  r  jl-  9  ^  i'J 


art 

This  step  is  performed  by  N5  which  gets  Y.  * .  '  fron  N4  and  receives  a  0  from 

1  »J  r  ,t 

the  host  on  the  input  links  p  ,  s  =  0,  1,2.  In  order  to  ensure  that  each  coeffi- 

3 

ciant  meets  the  corresponding  integral  at  the  right  time,  the  inputs  should  be 
supplied  according  to 

e 

(3) 


ir  =  aq+3!c  +  9  ae 

S  3 


■where,  for  ts3,  a^Ct)  =  a^?3  t*  with  Q  denoting  the  modulo  3  addition. 


__  a 

The  elements  of  the  symmetric  matrix  H~  are  produced  on  k  output  links, 
namely  2U  ^  +  u  =  0,...,k-1  (sea  Fig  1).  More  precisely,  the  elements  of  the 

th  e 

u  off  diagonal  of  H  appear  on  z  after  5m-q+3k+16  time  units  from  the 

U  ?4  +  4 

initiation  of  the  operation  of  the  system,  separated  from  each  other  by  two  tine 
units.  Tnis  is  formally  described  by 


where 


H, 


t  ,  t  +  u 


k.; 

I 

V 


M 

h 

m : 


u,q+'4 


ITU)  = 


Q6u  +  q+  3k  >  1  6  Q2  ua 
u 


'J“  0  f  •••  9  <C  ■“  1  . 


(9. a) 


[ 


if  t*k-u 


if  t>  k-u  . 


The  function  of  N6  is  the  generation  of  the  elanental  load  vectors.  It 


•1 


*—  utAtUo.  «'■  i-  »L. »~-V-  1L.V.V.V- V. V,  V.  V.  V.V 


•«..:»  i-.-i-i.aan, 
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$ 

link  r 

H 

ing  to 

receives  the  values  of  ^(g)  fVon  N4  and  the  load  f  from  the  host  on  the  input 

a 

link  r  ,  and  produces  the  elements  of  b’  on  the  output  link  z,  accord¬ 
ed  9  d  »3 ♦  ■* 


,_q*9k-t-1  6  .2  e 

Ck,q*4  =  a  9  uk  (9,b) 

8  0  8 
where,  for  t£k,  u^(t)  =  bt,  and  for  t>k,  u^(t)sS. 


In  sunmary,  equations  (9.a/b)  indicate  that  the  system  completes  the  com¬ 
putation  of  one  elemental  matrix  and  vector  in  12vt*q*16  time  units,  where  a  time 
unit  is  basically  the  time  required  to  perform  either  a  '  multiply/ add'  or  a 
•divide'  operation,  whichever  is  larger.  Although  this  is  a  noticeable  speed  up 
over  the  serial  execution  of  the  operations  involved  in  the  computation,  the 
system  suggested  here  ha3  two  other  important  merits,  namely 

J_)  The  smooth  movement  of  data  such  that  each  data  item  arrives  at  the  proper 
cell  When  it  is  needed.  This  eliminates  any  delay  in  execution  due  to  compli¬ 
cated  interprocess  communication  or  slow  memory  fetch. 

2)  The  ability  to  pipeline  the  computation  corresponding  to  the  different  ele¬ 
ments  on  the  system.  More  specifically,  if  the  input  data  for  the  different 
elements  are  pipelined  at  the  rate  of  the  data  for  one  element  every  3k  tine 
units,  then  the  results  will  be  produced  at  the  3a.me  rata  of  one  elenental 
matrix/ vector  every  3k  time  units.  This  pipelining  of  data  may  be  described 
formally  in  terms  of  the  piping  operator  of  Saction  1.  Namely,  if  the  inputs 
are  described  by 

'i,i  ■  p?=i,«‘e2 

4,1  !  1  p« i,,'®2  "3> 


.  a’*3**9  p*  „<e2  <.;> 
e=  1  ,.n  3 


s=D,  1,  2 


•  .  -  *  .  •>VW  . ■'  •  ■  .  .’•  *  *  •,**  .  ,  *  v'  .  -  . 

*  V*  **  •  •  K.'  %.*  V*  O  O  O.  O  V.'  *v 


then  it  uay  be  proved  formally  that  the  output  of  the  systeu  is  described  by 


'u.q  +  iJ 

3  e  e 
where  £  ,  n  •  a 
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6u>q*3k«-1  6  3*  ,2  e  . 

0  Pe=1  ,  „ C  6  »u  > 

and  u®  are  as  in  (6),  (3)  and  (3). 


u=  3  ,  ...  ,l< 


Although  the  tern  *  efficiency'  is  not  precisely  defined  for  systolio  net¬ 
works,  we  may  roughly  estimate  the  utilization  of  such  networks  by  calculating 
the  ratio  U  of  the  average  nunber  of  cycles  during  which  each  cell  in  the  system 
is  doing  useful  work  to  the  total  nunber  of  cycles  needed  to  conplete  the  execu¬ 
tion.  For  the  six  networks  N1,...,N6,  it  may  be  shown  that  U  is  larger  than 
50%.  This  means  that  we  are  using,  on  the  average,  more  than  half  of  our 
resources,  which  is  a  relatively  high  utilization  in  parallel  computation. 

4.  The  effect  of  boundary  conditions. 

The  next  step  in  the  analysis  is  the  modification  of  the  entries  of  the 
6  6 

elemental  arrays  H  and  b  ,  where  necessary,  to  account  for  the  boundary  condi¬ 
tions.  We  consider  here  two  types  of  modifications,  namely 

Modifications  needed  to  force  the  solution  to  zero  at  some  specified  nodes  of 
the  grid.  More  specifically,  in  order  to  force  the  solution  to  zero  at  a  cer- 

e  e  a 

tain  node  r ,  we  may  set  b.  =0  ,  H.  .  =3  ,  js  1  ,  . ..  ,k  ,  j  *i  ,  and  H.  .  =1  ,  for  each  a 

i  i  ,j  l  ,i 

j.  u 

and  i  satisfying  glob(e,i)  =  r.  This  is  equivalent  with  the  replacement  the  r 

equation  in  the  linear  system  (3)  by  u  =0  and  thus  guarantees  that  the  solu- 

r 

tion  is  zero  at  node  r. 

2)  Modifications  that  results  from  the  so  called  "essential  boundary  condi¬ 
tions".  As  mentioned  earlier,  this  type  of  conditions  may  be  accounted  for  by 

Q  a  q  0 

the  addition  of  a  sparse  matrix  S  and  vector  s  to  H”  and  b  ,  respectively. 


Also,  each  entry  in  the  modified  arrays  H  and  b  should  be  associated 


with  the  corresponding  global  labels  in  preparation  for  the  assenbly  3tage. 


More  specifically,  each  H.  .  should  be  associated  with  glob(e,i)  and  glob(e,j). 


re 


and  each  b^  should  be  associated  with  glob(e,i). 


% 

S. 


Figure  2  -  Addition  of  the  effect  of  boundary  conditions . 


The  unit  labeled  N7  in  figure  2  is  responsible  for  the  first  modification. 


It  is  connected  to  N5  and  N6  and  receives  froa  them  the  elenents  of  H~  and  b~ , 
3s1,...,m,  respectively.  It  also  receives  from  the  host  1)  the  k  global  labels 
glob(e,i),  is1,...,k  of  the  nodes  in  each  element  e,  and  2)  for  each  node  (e,i) 
a  single  bit  that  is  set  to  one  only  if  the  solution  at  (e,i)  is  to  be  forced  to 
zero.  Tnis  information  is  supplied  on  the  input  links  sQ  and  s^  according  to 


a 


i 


1  6 


i  =  0  ,  1 


Q 

where  the  elements  of  are  described,  for  tsk,  by 


Y;<t) 

= 

glob( e,t) 

i 

r 1 

if  the  solution  at  node  (  e,t)  is  forced  to  zero 

y0e(t) 

=  < 

\ 

1 

i 

•  0 

otherwise 

» 

The  unit  N8  is  responsible  for  the  addition  of  the  correction  arrays  S 
and  s8  to  the  elemental  matrices  He  and  b3 .  However,  because  most  of  the 


arrays  S'  and  s'  e=1,...,m  are  zero  arrays,  and  if  non  zero,  they  contain  only 
few  non  zero  entries,  the  addition  of  special  hardware  for  tha  computation  of 
these  non  zaro  entries  cannot  be  justified.  More  appropriately,  these  faw 
entries  may  be  computed  by  the  host  and  preloaded  into  a  local  tnenory  (LM2),  and 
some  logic  may  be  built  into  N8  to  retrieve  these  entries  wnen  the  corresponding 
entries  of  the  elemental  arrays  are  received  from  M7.  Detailed  implementations 
for  N7  and  N8  are  given  in  [3]. 


5.  Systeas  that  employ  direct  solvers 


In  Figure  3.  we  show  a  block  diagran  of  a  complete  system  that  uses  an  LU 
decomposition  for  solving  (3).  It  consists  of  tha  ho3t  and  four  functional 
units.  The  unit  labeled  GEM  is  the  generator  of  the  elemental  arrays  as 
described  in  some  detail  in  sections  3  and  4.  The  output  of  GEN  i3  then 
directed  into  the  unit  labeled  ASSEMB.  Its  function  is  to  assemble  the  global 
arrays  H  and  b.  The  third  unit,  FACT,  receives  H  and  b  from  ASSEMB  and  simul¬ 
taneously  performs  the  LU  factorization  and  produces  the  solution  y  of  Ly=b. 
Finally,  the  unit  BACK  solves  the  triangular  system  Uu=y  by  back  substitution. 


Figure  3  -  A  system  that  employs  a  direct  solver. 

In  order  to  simplify  the  discussion,  we  consider  only  the  assembly  of  H 

—a  th  — e  th 

and  we  denote  by  h.  the  i  row  of  the  symmetric  matrix  H  and  by  h.  the  i 


row  of  the  symmetric  global  matrix  H.  <fe  also  note  that  ASSEdB  receives  from 
GEN  the  elemental  matrices  pipelined  in  the  given  order.  Tie  row3 

within  each  matrix  H  are  pipelined  in  the  order  h1t...,h  .  Tnis  order  is 
determined  by  the  local  labels  given  to  the  nodes  of  the  grid. 

The  labels  es1,...,m  given  to  the  elenents  of  the  grid  are  of  particular 
interest  to  us.  If  the  element  labels  satisfy  the  property  that  any  element  e, 
isesra,  contains  at  least  one  node  that  does  not  belong  to  any  element 
then  we  call  such  a  labeling  scheme  a  proper  elenent  labeling.  The 
importance  of  proper  element  labeling  will  be  apparent  later. 

Each  element  .  is  received  by  ASSC4B  accompanied  by  the  global  labels 

1  f  J 

glob(e,i)  and  glob(e.j)  that  specify  the  position  at  which  it  should  be  accumu¬ 
lated  in  H.  Hore  precisely,  assuming  that  a  band  storage  scheme  is  used  for  H, 

_  A 

ASSEM8  accumulates  H’  .  in  row  glob(e,i)  of  H,  and  in  the  off-diagonal  position 

1  #J 

|glob(e,l)  -  glob(e,j)|. 

The  assembled  rows  of  H  are  then  passed  to  FACT  that  proceeds  with  the  LU 
factorization.  Here,  a  frontal  technique  is  naturally  U3ed  to  achieve  two 
important  goals:  1)  To  allow  FACT  and  ASSEHB  to  execute  in  parallel,  2)  To 
minimize  the  storage  requirement  of  ASSEdB. 

In  order  to  be  more  specific,  we  introduce  some  terminology.  During  the 
assembly  process,  a  row  hj^  is  said  to  be  active  from  the  moment  of  the  appear¬ 
ance  of  a  row  h®  with  glob(  e,j)  =  i,  that  is  from  the  time  when  its  assembly 
actually  starts.  On  the  other  hand,  h  is  called  a  ready  row  immediately  after 


the  accumulation  of  the  last  row  h£  with  glob(r,P)  =  i,  that  is  after  its  assembly 

has  been  completed.  In  other  words,  a  certain  row  of  H  is  partially  assenbled 
when  it  is  active  but  not  yet  ready.  Once  ready,  a  row  nay  be  passed  fron 
ASSEM8  to  FACT  and  its  storage  in  ASSEMB  nay  be  released. 

However,  in  all  the  known  parallel  schemes  for  the  direct  solution  of 
Hu=b,  the  rows  of  H  have  to  be  processed  in  a  sequential  order,  wnioh  neans  that 
the  ready  rows  of  H  should  be  produced  by  ASSE<3  in  sequential  order.  For¬ 
tunately,  we  nay  satisfy  this  restriction  by  assigning  appropriately  global 
labels  to  the  nodes.  The  following  node  numbering  algorithn  takes  this  restric¬ 
tion  into  consideration. 


ALG1: 

Given  a  proper  elenent  labeling  for  the  finite  elenents  and  a  corresponding 
local  .  numbering  of  the  nodes,  obtain  the  global  numbering  by  giving  the 
nodes  sequential  nunbers  in  the  following  order: 

1)  FOR  j=1 , ...  ,k  DO  glob(l.j)  =  j 

2)  FOR  e=2 . m  DO 

FOR  i*l,...,k  DO  IF  node  (e,i)  is  already  numbered  THEM  skip 

ELSE  increase  j  by  one  and  set  glob(e,i)  =  j. 


Mow,  assune  that  the  nodes  are  numbered  by  ALG1  and  that  the  bandwidth  of 
the  matrix  resulting  from  this  particular  numbering  is  23+1.  Than  it  may  be 
proved  [3]  that,  during  the  assembly  process,  the  rows  of  H  become  active  in  a 
purely  sequential  order.  Moreover,  whenever  a  certain  row  i  of  H  is  active  than 
the  rows  up  to  i-3-1  are  ready  and  may  be  processed  by  the  solver. 

Definition:  If,  at  a  specific  time  during  the  assembly  of  H,  a  certain  row  i, 
ISiSn  is  active,  then  the  rows  1,  ...  ,1-3-1  3re  called  B  ready  rows  of  H. 


-V  AAA  A  .U.V  j  •_  ■  j  V 


Fro.n  the  above  discussion,  it  follows  that  B_ready  rows  are  also  ready 
rows,  and  that  the  rows  of  H  becone  B_ready  in  a  purely  sequential  order.  How¬ 
ever,  a  given  row  .nay  be  ready  oefore  it  becones  B_reaJy.  Being  pessin istio ,  we 
will  pass  only  B_ready  row3  from  ASSEMB  to  FACT,  except  of  course  the  last  rows 
n-3,...,n  that  nay  be  passed  to  FACT  only  after  the  assembly  of  H  Is  conpleted. 

In  addition  to  satisfying  the  two  goals  stated  earlier,  the  above  rule  for 
the  interaction  between  ASSEMB  and  FACT  allows  ASSEMB  to  determine  automatically 
the  instant  at  which  a  row  is  ready  to  be  passed  to  FACT.  Thi3  eliminates  the 
preprocessing  step  that  is  usually  needed  in  frontal  techniques  to  determine  the 
instant  at  which  a  certain  row  is  ready.  More  precisely,  ASSEMB  nay  keep  a  flag 
"BMAX"  that  indicates  that  any  row  h^  ,  is  BMAX  is  B_ready  (and  hence  ready). 

This  flag  should  be  updated  whenever  a  row  h~  with  glob(  e ,  j )  -3  -1  >3MAX  13 
received . 

We  now  return  to  ALG1.  Although  this  algorithm  provides  a  good  nunbering 
scheme  from  the  point  of  view  of  processing  the  assembly  and  the  solution 
processes  in  parallel,  we  still  have  to  ensure  that  it  does  not  result  in  a 
large  bandwidth  B.  For  this  we  note  that  ALG1  is  a  two  step  algorithm;  First, 
the  elements  are  labeled,  and  then  the  nodes  within  the  elements  are  numbered. 
To  our  knowledge,  Fenves  and  Law  [3]  were  the  first  to  suggest  a  two  step 
nunbering  schame.  They  reported  experimental  results  which  show  that  if  the 
Cuthill-Mckee  [2]  algorithm  is  used  to  number  the  elements  in  a  two  step  algo¬ 
rithm,  then  the  bandwidth  of  the  resulting  matrix  is  comparable  with  the  bast 
known  algorithm  for  minimizing  the  bandwidth.  Here,  in  the  application  of  the 
Cuthill-Mckee  algorithm,  two  elements  are  considered  neighbors  if  they  share  a 
common  boundary,  rfe  examined  many  strange  shapes  of  meshes  and  in  only  rare 
cases  did  the  application  of  the  Cuthill-Mckee  algorithm  for  numbering  the 
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elements  result  in  a  non  proper  element  labeling.  Moreover,  in  all  these  rare 
oases,  a  proper  labeling  was  easily  obtained  by  changing  the  starting  element. 
The  existence  and  construction  of  a  proper  element  labeling  scheme  for  a  given 
mesh  is  still  a  question  that  needs  to  be  answered. 

Possible  impl mentation s  for  ISSEflB  and  FACT. 

It  is  clear  that  ASSEM8  has  to  handle  large  aaount  of  data  at  high  rates, 
which  necessitates  the  distribution  of  its  task  on  a  nunber  of  processors,  each 
being  responsible  for  the  assembly  of  the  elanents  in  one  or  sore  diagonals  of 
H.  We  consider  here  the  extrene  case  where  we  have  B+1  processor/ memory  units 

L  U 

W0,...,P<B  with  PMw  responsible  for  the  assembly  of  the  w  off-diagonal  of 
H.  A  communication  network,  COMM,  is  needed  to  distribute  the  data  received  by 

ASSEMB  to  the  appropriate  processor.  Namely,  when  an  element  Hf  is  received, 

1  tJ 

accompanied  with  glob(e,i)  and  glob(e,j),  COMM  3hould  direct  these  data  to  Ww, 
where  w=  |glob(  e,i) -glob(  e,j)  \ .  It  may  be  implemented  as  a  binary  tree  network, 
where  each  node  is  a  switch  that  uses  the  appropriate  bit  of  w  to  decide  whether 
to  pass  the  information  to  its  left  or  right  successor. 

In  the  implementation  suggested  in  [3],  the  process  executed  by  each  Ww 
is  driven  by  two  interrupts,  namely  input  and  output  interrupts.  The  input 

interrupt  takes  place  when  new  data  ( typically  H?  ,  and  glob(e,i))  are  received 

1  *J 

from  CCMM.  As  a  result,  H.  .  is  accumulated  in  the  position  v=glob(e,i)  of  the 

*  tJ 

diagonal  stored  in  Mw  and  the  flag  RMAX  is  set  to  max{3MAX, v-3-1 } .  Upon 
reception  of  all  the  elemental  arrays,  BMAX  is  set  to  n  to  indicate  that  any  row 
in  H  may  then  be  passed  to  FACT.  On  the  other  hand,  the  output  interrupt  is  of 
a  lower  priority  and  takes  place  when  the  output  port  connected  to  FACT  is  ready 
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to  receive  new  data.  A  comter  "consumed_w"  .nay  be  used  to  keep  track  of  the 
next  element  that  should  be  produced.  However,  if  thi3  element  is  in  a  row  that 
is  not  yat  B_ready  then  Pttw  would  have  to  wait  until  consumed_w  s  BMAX  and  than 
pass  the  element  to  FACT. 

One  difficulty  arises  from  the  distribution  of  the  assembly  process  on  tha 
B+1  PM'  s,  namely  that  upon  receipt  of  a  certain  row  ,  the  entries  of  this  row 
are  distributed  to  the  few  PM’ s  that  are  responsible  for  their  accunulation. 

~  Q 

Hance,  only  these  few  fld*  s  will  detect  tha  arrival  of  and  updata  accordingly 

their  copy  of  the  variable  BMAX.  The  copies  of  3MAX  in  tha  other  PM' s  will  not 
be  updated  unless  wa  provide  for  some  sort  of  interprocess  communication.  In 
order  to  solve  this  problem,  wa  may  store  BMAX  in  a  global  location  shared  by 
all  W's.  However,  since  WO  receives  the  diagonal  element  of  every  row  arriv¬ 
ing  at  ASSEdB,  it  seems  natural  to  have  only  PM  0  jpdata  BMAX.  Another  solution 
is  to  use  a  message  passing  technique  where  tha  updated  value  of  BMAX  i3  passed 
from  PMO  to  W1  to  PM  2,  ....  and  so  on. 

ifith  the  above  implementation  of  ASSdB,  FACT  should  have  enough  computing 
po war  to  process  the  ready  row3  of  H  at  the  high  rata  at  which  thay  may  be  pro¬ 
duced  by  ASSEMB.  This  powar  may  ba  obtained  from  a  very  high  speed  array  pro¬ 
cessor  that  may  become  available  in  the  future  as  a  result  of  advances  in  VLSI 
and  optical  communication  technologies.  However,  with  the  current  technology, 
the  most  suitable  candidates  for  the  implementation  of  FACT  are  systolic  arrays. 

Many  systolic  networks  have  been  suggested  in  the  literature  for  the  LU 
decomposition  of  positive  definite  banded  matrices  (e.g.  [5]  ),  and  specifically 
for  symmetric  matrices  [1,8].  They  all  require  that  the  elements  of  the  matrix 
be  supplied  diagonal-wise,  which  is  compatible  with  the  above  stricture  of 
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ASSEM3.  However,  any  implementation  of  FACT  nay  not  be  synchronized  by  a  global 
clock,  naaely  because  the  rate  at  which  ASSEdB  produces  the  B_ready  rows  of  H  is 
not  constant  due  to  the  nature  of  the  asse.nbly  process.  Hence,  a  self-timed 
technique  [11]  should  be  used  for  the  interaction  between  AS3E3B  and  FACT  and 
for  the  synchronization  of  the  operation  of  FACT. 

In  order  to  study  the  utilization  of  FACT  in  this  self-tined  environment, 

we  define  r  to  be  the  rate  at  which  FACT  would  consume  the  ready  rows  of  H  if 
0 

they  were  always  available  when  needed.  This  rate  for  the  networks  described  in 
[1,6,81  is  one  row  of  H  every  3  tine  units,  that  is  r^  =  1/3.  tfe  also  define 

r^(  t)  to  be  the  rate  at  which  ASSEMB  produces  the  B  ready  rows  of  H  at  any  par¬ 
ticular  tine  t.  For  the  inplementations  of  GSM  and  ASSEM B  discussed  here,  it 
may  be  shown  that,  at  any  tine  t,  rp(t)«S1  /3  row/tine  unit.  Tnis  means  that  the 

Bjready  rows  of  H  will  be  consumed  by  FACT  as  soon  as  they  are  produced  by 
ASSE13.  This  result  has  the  fallowing  Inpl ications: 

J_)  The  storage  in  each  Ww  should  be  large  enough  to  store  only  B+1  alenents  of 
th 

the  w  off-diagonal  of  H  at  a  tine. 

2)  The  generation  and  assembly  of  H  form  the  bottle  neck  of  the  entire  system. 
This  is  a  clear  indication  that  we  should  not  concentrate  our  effort  on  finding 
faster  parallel  solvers  for  linear  systems  and  neglect  the  problem  of  generating 
H  and  b  fast  enough  to  feed  these  fast  solvers. 

j[)  Although  FACT  is  self-timed,  we  may  accurately  estimate  the  tine  at  which  it 
will  complete  its  task.  For  example,  if  the  systolic  network  in  [3]  is  used  for 
FACT,  then  it  may  be  shown  that  the  decomposition  will  be  completed  one  time 
unit  after  FACT  receives  the  last  row  of  H.  Moreover,  the  last  B  row3  of  H  are 
made  available  to  FACT  just  after  ASSEdB  receives  its  last  input  at  tine 
3km*9k*q*1  6.  Since  FACT  is  able  to  consume  these  B  row3  in  33  tine  units,  we 


may  conclude  that  it  will  teminate  its  execution  at  tine  3k.in-9-C'*-33+q4-1  7  z 
3  0cm*3)  . 

Finally,  we  discuss  the  last  unit  in  Figure  3.  This  unit,  BACK,  performs 
the  back  substitution  step.  Although  its  task  is  simple,  BACK  cannot  start  it3 
computation  before  the  last  row  of  L  and  the  last  elenent  of  y  are  available. 
Hence,  a  temporary  storage,  TEMP,  must  be  provided  for  storing  the  elements  of  L 
and  y  upon  their  generation  until  FACT  terminates  its  execution.  Note  that  the 
systolic  network  for  back  substitution  described  in  [6]  may  be  U3ed  for  BACK. 
This  needs  2i  time  units  to  execute,  and  hence  the  entire  analysis  will  be  com¬ 
pleted  in  approximately  3(km>3)+2n  time  units,  which  is  a  considerable  speed  up 

2 

over  the  time  for  serially  executing  the  0(n  )  operations  involved  in  the 
analysis. 

Although  the  system  described  above  profits  from  all  apparent  concurren¬ 
cies  in  the  analysis,  it  has  a  serious  disadvantage,  namely  the  dependency  of 
its  architecture  on  the  bandwidth  Bof  the  matrix  H.  In  order  to  be  able  to  use 
a  system  designed  for  a  certain  bandwidth  B  for  a  problem  with  a  larger 
bandwidth,  we  should  be  able  to  partition  the  computation  appropr iately  to  allow 
its  execution  on  the  existing  hardware.  This  partitioning  seems  to  be  non¬ 
trivial  for  systolic  LU  decomposition  networks  due  to  their  complex  communica¬ 
tion  patterns,  ‘tore  research  is  needed  on  systolic  or  alternate  architectures 
for  the  direct  solutions  of  linear  systems  if  we  desire  to  have  a  system  that  is 
independent  of  the  bandwidth  B. 

6.  Srataas  that  eaploy  iterative  solvers. 

Direct  solution  schemes  for  the  linear  system  (3)  do  not  take  advantage  of 
the  fact  that  H  is  highly  sparse,  thus  missing  a  potential  for  savings  in  both 


storage  and  execution  tine.  For  this  reason,  it  is  sometimes  beneficial  to  use 
iterative  schemes  for  the  solution  of  (3)  despite  their  obvious  disadvantages, 
namely,  the  absence  of  a  good  criterion  for  chasing  the  initial  point  anl  the 
possible  divergence  or  slow  convergence  of  the  iteration. 


Many  iterative  schemes  exist  for  the  solution  of  Hu=b.  Hare,  we  restrict 
ourselves  to  those  schemes  that  involves  the  matrix  H  only  in  the  computation  of 
its  product  with  a  certain  vector.  This  product  may  be  formed  using  the 
unassembled  elemental  arrays,  thereby  eliminating  the  need  for  the  irregular 
assembly  stage.  More  specifically,  the  product  of  H  with  any  vector  p  may  be 
computed  from 

Hp  =  l  MeT  H3  M®  p  =  l  MeT  H3  p®. 
e=  1  e=1 

This  multiplication  scheme  is  attractive  in  our  case  because  the  partial  pro- 
duct3  H  p  for  e=1,...,m  may  be  pipelined  at  the  sane  rate  at  which  the 

—a 

arrays  H”  are  generated . 

In  Figure  4,  we  show  a  block  diagram  of  a  systolic  system  that  employs 
iterative  solvers.  It  is  composed  of  the  host  and  two  systolic  functional 
units;  namely  GEM  for  the  generation  of  the  elenent3l  arrays  and  MULT  for  the 
matrix/ vector  multiplication.  In  this  system,  the  host  is  more  involved  in  the 


computation  than  in  the  system  that  employs  direct  solvers.  In  fact,  the  host 
is  a  general  purpose  computer  that  executes  a  sequential  finite  elenent  program 
and  uses  the  systolic  units  GEN  and  MULT  as  high  speej  devices  to  perforn  some 
compute-bound  operations  in  the  program. 

The  block  labeled  STORE  in  Figure  4  is  a  storage  device  used  to  store  the 
elemental  arrays  in  order  to  use  then  in  successive  iteration  steps.  However, 
if  the  speed  of  STORE  is  such  that  it  cannot  provide  MULT  with  the  elemental 
arrays  at  the  required  high  rata,  then  STORE  nay  be  eliminated  from  the  sy3ten , 
and  GEN  may  be  used  to  regenerate  the  elemental  arrays  in  each  step  of  the 
iteration.  This  idea  of  regenerating  the  elemental  arrays  is  more  attractive  if 
a  multigrid  technique  i3  used  for  the  solution  of  Hu=b.  In  that  case,  the 
regeneration  of  the  elemental  arrays  becones  an  essential  operation.  Note  that 
the  architectures  of  CEN  and  MULT  neither  depend  on  the  specific  grid  that  cover 
the  domain  of  the  problem  nor  on  the  bandwidth  of  the  resulting  matrix  H. 
Hence,  the  matrices  corresponding  to  the  different  grids  may  be  generated  from 
GEN  without  any  system  reconfigur ation. 

Finally,  we  note  that  only  limited  spaed  up  may  be  obtained  by  using  the 
pipelined  idea  with  iterative  solvers.  This  is  namely  due  to  the  fact  that  suc¬ 
cessive  steps  in  commonly  used  iterative  solvers  cannot  be  pipelined.  For  exam¬ 
ple,  in  the  conjugate  gradient  method,  a  new  step  cannot  be  initiated  before  the 
termination  of  a  dot  product  that  depends  on  the  previous  iterate.  It  seems 
that  the  success  of  a  pipelined/ iterative  finite  element  system  is  largely 
dependent  on  the  availability  of  an  iterative  solution  scheme  in  which  succes¬ 
sive  steps  may  be  pipelined. 
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7 .  Conclusion. 

Pipelining  is  a  straight  forward  approach  for  processing  a  certain  compu¬ 
tation  in  paralleL  on  some  hardware  that  is  not  dependent  on  the  size  of  the 
problem  to  be  solved.  In  this  paper,  we  discussed  pipelined  solutions  to  linear 
finite  element  problems.  In  order  to  maximize  the  benefits  from  a  pipelined 
system,  the  execution  time  of  the  different  stages  in  the  pipe  should  be  approx¬ 
imately  equal.  Tnis  may  be  accomplished  by  including  in  each  functional  unit  in 
the  pipe  an  amount  of  hardware  proportional  to  the  computation  performed  by  that 
unit.  Unfortunately,  in  the  case  of  complex  computations  as  that  involved  in 
finite  element  analysis,  this  means  that  the  architecture  of  each  unit  may 
depend  on  some  parameters  of  the  given  problem.  For  instance,  the  LU  factoriza¬ 
tion  unit  depends  on  the  bandwidth  Bof  the  stiffness  matrix  and  the  array  gen¬ 
eration  units  depend  on  the  number  of  nodes  k  in  each  finite  element.  Conse¬ 
quently,  a  system  designed  far  a  certain  B  and  k  cannot  be  used  for  problems 
with  B ' >3  or  k’  >k. 

These  restrictions  on  B  and  k  result  from  the  use  of  systolic  networks 
with  very  simple  types  of  cells  for  the  majority  of  the  functional  units  in  the 
pipe.  This  has  the  advantage  of  achieving  a  smooth  and  regular  flow  of  data  in 
the  system,  thus  increasing  its  execution  speed.  However,  a  system  that  is 
independent  of  B  and  k  may  be  obtained  if  we  allow  for  more  flexible  cells  and, 
accordingly,  partition  the  computation  within  each  unit  such  that  each  cell  is 
assigned  to  a  larger  share  of  the  computation.  The  execution  tine  of  the  dif¬ 
ferent  units  in  the  modified  system  should  be  kept  approximately  equal,  fc  did 
not  discuss  such  a  decoposition  in  this  paper. 

The  common  problem  of  communicating  data  at  a  high  rate  to  and  from  sys¬ 
tolic  arrays  are  not  present  in  the  system  suggested  here.  Namely,  dat3  are 


supplied  to  the  first  unit  in  the  pipe  at  a  rate  of  3<*10  Items  every  3k  tine 
units,  and  results  are  collected  fron  the  last  unit  at  an  average  rate  of  B/jj 
item  every  3<  time  units.  The  communication  between  the  other  pipe-jnit3  do  not 


require  any  intervention  from  the  host  that  controls  the  entire  operation. 

Finally,  we  hope  that  this  work  will  lead  to  more  research  for  the  further 
exploration  of  the  idea  of  pipelining  finite  element  computations. 
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